Probabilistic Frequent Pattern Growth for 
Itemset Mining in Uncertain Databases 
(Technical Report) 



Thomas Bernecker, Hans-Peter Kriegel, Matthias Renz, Florian Verhein* 

and Andreas Ziifle 
{ bernecker, kriegel, renz, verhein, zuefle} Qdbs.ifi. Imu. de 

Institute for Informatics, Ludwig-Maximilians-Universitat Miinchen, Germany 



Abstract. Frequent itemset mining in uncertain transaction databases 
semantically and computationally differs from traditional techniques ap- 
plied on standard (certain) transaction databases. Uncertain transaction 
databases consist of sets of existentially uncertain items. The uncertainty 
of items in transactions makes traditional techniques inapplicable. In this 
paper, we tackle the problem of finding probabilistic frequent itemsets 
based on possible world semantics. In this context, an itemset X is called 
frequent if the probability that X occurs in at least minSup transactions is 
above a given threshold r. We make the following contributions: We pro- 
pose the first probabilistic FP-Growth algorithm (ProFP-Growth) and 
associated probabilistic FP-Tree (ProFP-Tree), which we use to mine all 
probabilistic frequent itemsets in uncertain transaction databases with- 
out candidate generation. In addition, we propose an efficient technique 
to compute the support probability distribution of an itemset in linear 
time using the concept of generating functions. An extensive experimen- 
tal section evaluates the our proposed techniques and shows that our 
ProFP-Growth approach is significantly faster than the current state-of- 
the-art algorithm. 



1 Introduction 

Association rule analysis is one of the most important fields in data mining. It is 
commonly applied to market-basket databases for analysis of consumer purchas- 
ing behavior. Such databases consist of a set of transactions, each containing the 
items a customer purchased. The most important and computationally intensive 
step in the mining process is the extraction of frequent itemsets - sets of items 
that occur in at least minSup transactions. It is generally assumed that the items 
occurring in a transaction are known for certain. However, this is not always the 
case. For instance; 

— In many applications the data is inherently noisy, such as data collected by 
sensors or in satellite images. 



* Contact author, verhein@dbs.ifi.lmu.de or http://www.florian.verhein.com/contact/ 



— In privacy protection applications, artificial noise can be added deliberately 
[19J. Finding patterns despite this noise is a challenging problem. 

— By aggregating transactions by customer, we can mine patterns across cus- 
tomers instead of transactions. This produces estimated purchase probabil- 
ities per item per customer rather than certain items per transaction. 

In such applications, the information captured in transactions is uncertain 
since the existence of an item is associated with a likelihood measure or existen- 
tial probability. Given an uncertain transaction database, it is not obvious how 
to identify whether an item or itemset is frequent because we generally cannot 
say for certain whether an itemset appears in a transaction. In a traditional 
(certain) transaction database on the other hand, we simply perform a database 
scan and count the transactions that include the itemset. This does not work in 
an uncertain transaction database. 

An example of a small uncertain transaction database is given in Figure ^ 
where for each transaction U, each item x is listed with its probability of existing 
in tj. Items with an existential probability of zero can be omitted. We will use 
this dataset as a running example. 

Prior to [B|, expected support was used to deal with uncertain databases |7I8| . 
It was shown in [3] that the use of expected support in probabilistic databases 
had significant drawbacks which led to misleading results. The proposed alter- 
native was based on computing the entire probability distribution of itemsets' 
support, and achieved this in the same runtime as the expected support approach 
by employing the Poisson binomial recurrence relation. |6J adopts an Apriori-like 
approach, which is based on an anti-monotone Apriori property [5] (if an item- 
set X is not frequent, then any itemset X U Y is not frequent) and candidate 
generation. 

However, it is well known that Apriori-like algorithms suffer a number of 
disadvantages. First, all candidates generated must fit into main memory and 
the number of candidates can become prohibitively large. Secondly, checking 
whether a candidate is a subset of a transaction is non-trivial. Finally, the entire 
database needs to be scanned multiple times. In uncertain databases, the effective 
transaction width is typically larger than in a certain transaction database which 
in turn can increase the number of candidates generated and the resulting space 
and time costs. 

In certain transaction databases, the FP-Growth Algorithm [TT] has become 
the established alternative. By building an FP-Tree - effectively a compressed 
and highly indexed structure storing the information in the database - candidate 
generation and multiple database scans can be avoided. However, extending this 
idea to mining probabilistic frequent patterns in uncertain transaction databases 
is non-trivial. It should be noted that previous extensions of FP-Growth to 
uncertain databases used the expected support approach |1I14| . This is much 
easier since these approaches ignore the probability distribution of support. 

In this paper, we propose a compact data structure called the probabilistic 
frequent pattern tree (ProFP-tree) which compresses probabilistic databases and 
allows the efficient extraction of the existence probabilities required to compute 



the support probability distribution and frequentness probability. Additionally 
we propose the novel ProFPGrowth algorithm for mining all probabilistic fre- 
quent itemsets without candidate generation. 



TID 


Transaction 


1 


(A, 1.0), (B, 0.2), (C, 0.5) 


2 


(A, 0.1), (D, 1.0)) 


3 


(A, 1.0), (B, 1.0), (C, 1.0), (D, 0.4) 


4 


(A, 1.0), (B, 1.0), (D, 0.5) 


5 


(B, 0.1), (C, 1.0) 


6 


(C, 0.1), (D, 0.5) 


7 


(A, 1.0), (B, 1.0), (C, 1.0) 


8 


(A, 0.5), (B, 1.0) 



Fig. 1. Uncertain Transaction Database (running example) 



1.1 Uncertain Data Model 

The uncertain data model applied in this paper is based on the possible worlds 
semantic with existential uncertain items. 

Definition 1 An uncertain item is an item x G / whose presence in a transac- 
tion t G T is defined by an existential probability P(x G t) G (0, 1). A certain 
item is an item where P(x G t) G {0, 1}. / is the set of all possible items. 

Definition 2 An uncertain transaction t is a transaction that contains uncer- 
tain items. A transaction database T containing uncertain transactions is called 
an uncertain transaction database. 

An uncertain transaction t is represented in an uncertain transaction database 
by the items x £ I associated with an existential probability value [^]P(a; G t) G 
(0,1]. An example of an uncertain transaction databases is depicted in Figure 
^ To interpret an uncertain transaction database we apply the possible world 
model. An uncertain transaction database generates possible worlds, where each 
world is defined by a fixed set of (certain) transactions. A possible world is in- 
stantiated by generating each transaction fj G T according to the occurrence 
probabilities P(x G ti). Consequently, each probability < P(x G ti) < 1 de- 
rives two possible worlds per transaction: One possible world in which x exists 
in ti, and one possible world where x does not exist in U. Thus, the number 
of possible worlds of a database increases exponentially in both the number of 

1 If an item x has an existential probability of zero, it does not appear in the trans- 
action. 



transactions and the number of uncertain items contained in it. Each possible 
world w is associated with a probability that that world exists, P(w). 

We assume that uncertain transactions are mutually independent. This as- 
sumption is reasonable in real world applications. Additionally, independence 
between items is often assumed in the literature |7I8| . This can be justified by 
the assumption that the items are observed independently. In this case, the 
probability of a world w is given by: 

PH = 11(11 p{ - x G *) * Il( 1 - p ( x G *))) 

In cases where this assumption does not hold and conditional probabilities 
are available they may be used in our methods. 

Example 1 . In the database of Figure [T] the probability of the world existing in 
which t\ contains only items A and C and t 2 contains only item D is P(A 6 
h)*(l-P(B € h))*P(C € t 1 )*(l-P(A £ t 2 )*P(D € t 2 ) = 1.0-0.8-0.5-0.9-1.0 = 
0.36. For simplicity we omit the consideration of other customers in this example. 

1.2 Problem Definition 

An itemset is a frequent itemset if it occurs in at least minSup transactions, 
where minSup is a user specified parameter. In uncertain transaction databases 
however, the support of an itemset is uncertain; it is defined by a discrete prob- 
ability distribution function (p.d.f). Therefore, each itemset has a frequentness 
probabiliti^- the probability that it is frequent. In this paper, we focus on the 
two distinct problems of efficiently calculating this p.d.f. and efficiently extract- 
ing all probabilistic frequent itemsets; 

Definition 3 A Probabilistic Frequent Itemset (PFI) is an itemset with a fre- 
quentness probability of at least t. 

The parameter r is the user specified minimum confidence in the frequentness 
of an itemset. 

We are now able to specify the Probabilistic Frequent Itemset Mining (PFIM) 
problem as follows; Given an uncertain transaction database T, a minimum sup- 
port scalar minSup and a frequentness probability threshold r, find all proba- 
bilistic frequent itemsets. 

1.3 Contributions 

We make the following contributions: 

— We introduce the probabilistic Frequent Pattern Tree, or ProFP-Tree, which 
is the first FP-Tree type approach for handling uncertain or probabilistic 
data. This tree efficiently stores a probabilistic database and enables efficient 
extraction of itemset occurrence probabilities and database projections. 
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— We propose ProFPGrowth, an algorithm based on the ProFPTrcc which 
mines all itemsets that are frequent with a probability of at least r without 
using expensive candidate generation. 

— We present an intuitive and efficient method based on generating functions 
for computing the probability that an itemset is frequent, as well as the 
entire probability distribution function of the support of an itemset, in 0(|T|) 
tim(^] Using our approach, our algorithm has the same time complexity as 
the approach based on the Poisson Binomial Recurrence (denoted as dynamic 
programming technique) in jS], but it is much more intuitive and thus offers 
various advantages, as we will show. 

The remainder of this paper is organized as follows; Section [2] surveys related 
work. In Section [3] we present the ProFP-Tree, explain how it is constructed and 
briefly introduce the concept of conditional ProFPTrees. Section |4]describes how 
probability information is extracted from a (conditional) ProFP-Tree. Section 
[5] introduces our generating function approach for computing the frequentness 
probability and the support probability distribution in linear time. Section [6] 
describes how conditional ProFPT-rees are built. Finally, Section [7] describes 
the ProFP-Growth algorithm by drawing together the previous sections. We 
present our experiments in Section [8] and conclude in Section [9] 

2 Related Work 

There is a large body of research on Frequent Itemset Mining (FIM) but very lit- 
tle work addresses FIM in uncertain databases |7I8I13| . The approach proposed 
by Chui et. al [5] computes the expected support of itemsets by summing all 
itemset probabilities in their U-Apriori algorithm. Later, in [7], they addition- 
ally proposed a probabilistic filter in order to prune candidates early. In [13], the 
UF-growth algorithm is proposed. Like U-Apriori, UF-growth computes frequent 
itemsets by means of the expected support, but it uses the FP-tree approach 
in order to avoid expensive candidate generation. In contrast to our probabilis- 
tic approach, itemsets are considered frequent if the expected support exceeds 
minSup. The main drawback of this estimator is that information about the un- 
certainty of the expected support is lost; 08 13] ignore the number of possible 
worlds in which an itemset is frequent. [21 proposes exact and sampling-based 
algorithms to find likely frequent items in streaming probabilistic data. However, 
they do not consider itemsets with more than one item. The current state-of- 
the-art (and only) approach for probabilistic frequent itemset mining (PFIM) 
in uncertain databases was proposed in [fy. Their approach uses an Apriori-like 
algorithm to mine all probabilistic frequent itemsets and the poisson binomial 
recurrence to compute the support probability distribution function (SPDF). 
We provide a faster solution by proposing the first probabilistic frequent pattern 
growth approach (ProFP-Growth), thus avoiding expensive candidate genera- 
tion and allowing us to perform PFIM in large databases. Furthermore, we use 
a more intuitive generating function method to compute the SPDF. 
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Existing approaches in the field of uncertain data management and mining 
can be categorized into a number of research directions. Most related to our work 
are the two categories "probabilistic databases" [5I16I17B] and "probabilistic query 
processing" |9ll2J20ll8| . 

The uncertainty model used in our approach is very close to the model used 
for probabilistic databases. A probabilistic database denotes a database com- 
posed of relations with uncertain tuples |9j , where each tuple is associated with 
a probability denoting the likelihood that it exists in the relation. This model, 
called "tuple uncertainty" , adopts the possible worlds semantics [4 . A probabilis- 
tic database represents a set of possible "certain" database instances (worlds), 
where a database instance corresponds to a subset of uncertain tuples. Each 
instance (world) is associated with the probability that the world is "true". The 
probabilities reflect the probability distribution of all possible database instances. 
In the general model description [T7] , the possible worlds are constrained by rules 
that are defined on the tuples in order to incorporate object (tuple) correlations. 
The ULDB model proposed in jS], which is used in Trio [2 , supports uncertain 
tuples with alternative instances which are called x-tuples. Relations in ULDB 
are called x-relations containing a set of x-tuples. Each x-tuple corresponds to a 
set of tuple instances which are assumed to be mutually exclusive, i.e. no more 
than one instance of an x-tuple can appear in a possible world instance at the 
same time. Probabilistic top-k query approaches [18 20 16J are usually associated 
with uncertain databases using the tuple uncertainty model. The approach pro- 
posed in [20] was the first approach able to solve probabilistic queries efficiently 
under tuple independency by means of dynamic programming techniques. Re- 
cently, a novel approach was proposed in |15| to solve a wide class of queries in 
the same time complexity, but in a more elegant and also more powerful way us- 
ing generating functions. In our paper, we adopt the generating function method 
for the efficient computation of frequent itemsets in a probabilistic way. 

3 Probabilistic Frequent-Pattern Tree (ProFP-tree) 

In this Section we introduce a novel prefix-tree structure that enables fast detec- 
tion of probabilistic frequent itemsets without the costly candidate generation 
or multiple database scans that plague Apriori style algorithms. The proposed 
structure is based on the frequent-pattern tree (FP-tree [H]). In contrast to the 
FP-tree, the ProFP-tree has the ability to compress uncertain and probabilis- 
tic transactions. If a dataset contains no uncertainty it reduces to the (certain) 
FP-Tree. 

Definition 4 (ProFP-tree) A probabilistic frequent pattern tree is composed 
of the following three components: 

1. Uncertain item prefix tree: A root labelled "null" pointing to a set of 
prefix trees each associated with uncertain item sequences. Each node n in 
a prefix tree is associated with an (uncertain) item and consists of five 
fields: 



— n.item denotes the item label of the node. Let path(n) be the set of items 
on the path from root to n. 

— n. count is the number of certain occurrences of path(n) in the database. 

— n.uft, denoting "uncertain- from-this", is the set of transaction ids ftidsj. 
A transaction t is contained in uft if and only if n.item is uncertain in 
t (i.e. < P(n.item G t) < 1) and P(path(n) C t) > 0. 

— n.ufp, denoting "uncertain- from-prefix", is a set of transaction ids. A 
transaction t is contained in ufp if and only if n.item is certain in t 
(P(n.item Et) = l) and < P(path(n) Ct)<l. 

— n.node — link links to the next node in the tree with the same item label 
if there exists one. 

2. Item header table: This table maps all items to the first node in the Un- 
certain item prefix tree 

3. Uncertain- item lookup table: This table maps item, tid pairs to the prob- 
ability that item appears in tad for each transaction tud contained in a uft 
of a node n with n.item = item. 

The two sets, uft and ufp, are specialized fields required in order to handle 
the existential uncertainty of itemsets in transactions associated with path(n). 
We need two sets in order to distinguish where the uncertainty of an itemset 
(path) comes from. Generally speaking, the entries in n.uft are used to keep 
track of existential uncertainties where the uncertainty is caused by n.item, 
while the entries in ufp keep track of uncertainties of itemsets caused by items 
in path(n) — n.item but where n.item is certain. 

Figure [2]illustrates the ProFP-tree of our example database of Figure [T] Each 
node of the uncertain item prefix tree is labelled by the field item. The labels 
next to the nodes refer to the node fields count: uft ufp. The dotted lines denote 
the node-links. 

The ProFP-tree has the same advantages as a FP-tree, in particular: It avoids 
repeatedly scanning the database since the uncertain item information is effi- 
ciently stored in a compact structure. Secondly, multiple transactions sharing 
identical prefixes can be merged into one with the number of certain occurrences 
registered by count and the uncertain occurrences reflected in the transaction 
sets uft and ufp. 

3.1 ProFP-Tree Construction 

For further illustration, we refer to our example database of Figure [T] and the 
corresponding ProFP-tree in Figure [2j We assume that the (uncertain) items in 
the transactions are lexicographically ordered, which is required for prefix tree 
construction. 

We first create the root of the uncertain item prefix tree labelled "null". 
Then we read the uncertain transactions one at a time. While scanning the 
first transaction t±, the first branch of the tree can be generated leading to the 
first path composing entries of the form (item, count, uft, ufp, node-link). In our 
example, the first branch of the tree is built by the following path: 




0:[3]fiXDj' 

(a) Uncertain item prefix tree with item header table. 
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(b) Uncertain-item lookup table. 



Fig. 2. ProFPTree generated from the uncertain transaction database given in Fig- 
ure [1] 



<root,(A,l,W,[],null),(B,0,ll},\},null),(C,0,\l},\],null)>. 

Note that the entry "1" in the field uft of the nodes associated with B and 
C indicate that item B and C are uncertain in t±. 

Next, we scan the second transaction ti and update the tree structure ac- 
cordingly. The itemset of transaction t<i shares its prefix with the previous one, 
therefore we follow the existing path in the tree starting at the root. Since the 
first item in t% is existentially uncertain, i.e. it exists in ti with a probability of 
0.1, count of the first node in the path is not incremented. Instead, the current 
transaction ti is added to uft of this node. The next item in t<i does not match 
with the next node on the path and, thus, we have to build a new branch leading 
to the leaf node ./V with the entry (D,0,[],[2],nuZZ). Although item D is existen- 
tially certain in t% count of N is initialized with zero, because the itemset A,D 
associated with the path from the root to node N is existentially uncertain in t 2 
due to the existential uncertainty of item A. Hence, we add transaction to the 
uncertain-from-prefix (ufp) field of n. The resulting tree is illustrated in Figure 




(a) After inserting ti and t2 (b) After inserting t\, £2 and % 



Fig. 3. Uncertain item "prefix tree after insertion of the first transactions. 

The next transaction to be scanned is transaction t%. Again, due to match- 
ing prefixes we follow the already existing path <A,B,C>|^] while scanning the 
(uncertain) items in £3. The resulting tree is illustrated in Figure [3 (b)| Since the 
first item A is existentially certain, count of the first node in the prefix path is 
incremented by one. The next items, item B and C, are registered in the tree 
in the same way by incrementing the count fields. The rational for these count 
increments is that the corresponding itemsets are existentially certain in t% . The 
final item D is processed by adding a new branch below the node C leading to 
a new leaf node with the fields: (D,0,[3],[],p£r), where the link ptr points to the 
next node in the tree labelled with item label D. Since item D is existentially 
uncertain in t% the count field is initialized with and £3 is registered in the 
uft set. The uncertain item prefix tree is completed by scanning all remaining 
transactions in a similar fashion. 

The ProFP-tree construction algorithm is shown in Algorithm [T] 

3.2 Construction Analysis 

The construction of the ProFP-tree requires a single scan of the uncertain trans- 
action database T. For each processed transaction we must follow and update 
or construct a single path of the tree, of length equal to the number of items in 
the corresponding transaction. Therefore the ProFP-tree is constructed in linear 
time w.r.t. to size of the database. 

Since the ProFP-tree is based on the original FP-tree, it inherits its com- 
pactness properties. In particular, the size of a ProFP-tree is bounded by the 
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Algorithm 1 ProFP-Tree ConstructionCreation. 



input: An uncertain transaction Database T with lexicographically ordered items, 

and a minimum support threshold minSup. 

Output: A probabilistic frequent pattern tree (ProFP-Tree). 

Method: 

Create the (null) root of an uncertain item prefix tree T; 
Initialize an empty item header table (iht); 
Initialize an empty uncertain-item lookup table (ult); 
for each uncertain transaction t < € T 

Build a string <it\, ■ ■ ■ ,it n > of tuples itj=(item,prob), 

where the field item identifies a(n) (un)certain item of ti 

and the field prob denotes the probability P(itj .item £ ti). 

Call insert-transaction(<iti, ■ ■ ■ , it n > ,i,T .rootfi) 

ins ert-trans action( transaction,!, node,u_ flag ) 
while it:— transaction. get_next_item() not null do 
if node has a child N with N .item = it. item, then 

call update-node-entries(it,i,N,u_flag); //follow exist, path 
else / /create new branch: 
create new child N of T; 
call update-node- entries (it, i, N,u_flag); 
if it. item not in iht then 

insert (it.item,ptr(N)) into iht; 
else 

insert node N into the link list associated with it. item; 
//update uncertain-item lookup table 
if it.prob<1.0 then 

insert (i,it.item,it.prob) into ult; 
node:= N; 

update-node-entries(it,i,N ,u_flag) 
if it.prob=1.0, then 
if u_flag=0 then 

increment N. count by 1; 
else //u_flag=l 
insert i into N.ufp; 

else 

insert i into N.uft; 
set u_flag:=l; 



overall occurrences of the (un)certain items in the database and its height is 
bounded by the maximal number of (un)certain items in a transaction. For any 
transaction ti in T, there exists exactly one path in the uncertain item, prefix 
tree starting below the root node. Each item in the transaction database can 
create no more than one node in the tree and the height of the tree is bounded 



by the number of items in a transaction (path). Note that as with the FP-Tree, 
the compression is obtained by sharing common prefixes. 

We now show that the values stored at the nodes do not affect the bound 
on the size of the tree. In particular, in the following Lemma we bound the 
uncertain-from-this (uft) and uncertain-from-prefix (ufp) sets. 

Lemma 5 Let T be the uncertain item prefix tree generated from an uncertain 
transaction database T. The total space required by all the transaction-id sets 
( uft and ufpj in all nodes in T is bounded by the the total number of uncertain 
occurrence^ in T. 

The rational for the above lemma is that each occurrence of an uncertain 
item (with existence probability in (0, 1)) in the database yields at most one 
transaction-id entry in one of the transaction-id sets assigned to a node in the 
tree. In general there are three update possibilities for a node N: If the current 
item and all prefix items in the current transaction are certain, there is no 
new entry in uft or ufp as count is incremented, ti is registered in N.uft if and 
only if N.item is existentially uncertain in U while ti is registered in N.ufp if 
and only if N.item is existentially certain in in i; but at least one of the prefix 
items in ti is existentially uncertain. Therefore each occurrence of an item in T 
leads to either a count increment or a new entry in uft or ufp. 

Finally, it should be clear that the size of the uncertain item lookup table 
is bounded by the number of uncertain (non zero and non 1) entries in the 
database. 

In this section we showed that the ProFP-Tree inherits the compactness of the 
original FP-Tree. In the following Section we show that the information stored 
in the ProFP-tree suffices to retrieve all probabilistic information required for 
PFIM, thus proving completeness. 

4 Extracting Certain and Uncertain Support 
Probabilities 

Unlike the (certain) FP-Growth approach where extracting the support of an 
itemset X is easily achieved by summing the support counts along the node- 
links for X in a suitable conditional ProFPTree, we are interested in the support 
distribution of X in the probabilistic case. Before we can compute this however, 
we first require both the number of certain occurrences as well as the probabilities 
< P(X £ ti) < 1. Both can be efficiently obtained using the ProFP-Tree as 
follows: 

To obtain the certain support of an item x, follow the node-links from the 
header table and accumulate both the counts and the number of transactions in 
which x is uncertain-from-prefix. The latter is counted since we are interested 
in the support of x and by construction, transactions in ufp are known to be 
certain for x. To find the set of transaction ids in which x is uncertain, follow the 
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node-links and accumulate all transactions that are in the uncertain- from-this 
(lift) list. 

Example 2. By traversing the node-list, we can calculate the certain support for 
item C in the ProFP-Tree in Figure [2] as follows: 2 + |0| + |{i 5 }| + |0| = 3. 
Note there is one transaction in which C is uncertain-from-prefix (t$). Similarly, 
we find that the only transactions in which C is uncertain are t\ and te. The 
exact appearance probabilities in these transactions can be obtained from the 
uncertain-item lookup table. By comparing this to Figure [l] we see that the tree 
allows us to obtain the correct certain support and the transaction ids where C 
is uncertain. 

To compute the support of an itemset X = {a, k}, we use the conditional 
tree for items 6, k and extract the certain support and uncertain transaction 
ids for a. Since it is somewhat involved, we defer the construction of conditional 
ProFP-Trees to Section [6] By using the conditional tree, the above method 
provides the certain support of X and the exact set of transaction ids in which 
X is uncertain (utids). To compute the probabilities P(X £ t{) : ti € utids we 
use the independence assumption and multiply, for each x G X the probability 
that x appears in ti. Recall that the probability that X appears in ti is an 0(1) 
lookup in the uncertain-item lookup table. Recall that if additional information 
is given on the dependencies between items, this can be incorporated here. 

We have now described how the certain support and all probabilities P(X € 
t) : X uncertain in t can be efficiently computed from the ProFPTree (Algo- 
rithm [2]). Section [5] shows how we use this information to calculate the support 
distribution of X. 

5 Efficient Computation of Probabilistic Frequent 
Itemsets 

This section presents our linear-time technique for computing the probabilistic 
support of an itemset using generating functions. The problem is as follows: 

Definition 6 Given a set of N mutually independent but not necessarily iden- 
tical Bernoulli (0/1) random variables P(X G U), 1 < i < N, compute the 
probability distribution of the random variable Sup — X)aT 1 -^Q 

A naive solution would be to count for each < k < N all possible worlds in 
which exactly k items contain X and accumulate the respective probabilities. 
This approach however, shows a complexity of 0(2 ). In [6_ an approach has 
been proposed that achieves an O(N) complexity using Poisson Binomial Recur- 
rence. Note that O(N) time is asymptotically optimal in general, since the com- 
putation involves at least O(N) computations, namely P(X £ ti)Vl < i < N. 
In the following, we propose a different approach that, albeit having the same 
linear asymptotical complexity, has other advantages. 



Algorithm 2 Extract Probabilities for an itemset. 



//calcuate the certain support and the uncertain transaction ids of an item 
//derived from a PFP-Tree 
extr&ct(item,ProF P — Tree tree) 
cert Sup = 0; uncertain SupT ids = 0; 
for each ProFPNode in tree reachable 
from header table[i£em] 
certSupp+ — n.certSupp; 
certSupp+ = \n.ufp\; 

uncertainSupTids = uncertainSupTids U n.uft; 
return certSupp, uncertainSupTids; 

//calculate the existential probabilities of an itemset 
calculateProbabilities( itemset, uncertainSupTids) 
probabilityV ector — 0; 
for (t £ uncertainSupTids) 
p = TIii n ite m 3etuncertainItemLookupTable[i,t]; 
probabilityV ector .add(p) ; 
return probabilityV ector; 



5.1 Efficient Computation of Probabilistic Support 

We apply the concept of generating functions as proposed in the context of 
probabilistic ranking in |15| . Consider the function: J-(x) — n"=i( a * + hx). 
The coefficient of x in F(x) is given by: X)|/3|=fe Wi-p =o a * Ili-^ =i ^ii where 
P = (Pi, Pm) is a Boolean vector, and \P\ denotes the number of l's in p. 
Now consider the following generating function: 

r = J] (l - P(X et) + P(X et)-x)= c 3 xl 

te{t u ...u} je{o,..., 4 } 

The coefficient Cj of x^ in the expansion of F l is exactly the probability that 
X occurs in exactly j if the first i transactions; that is, the probability that the 
support of X is j in the first i transactions. Since T l contains at most i + 1 
nonzero terms and by observing that 

r = F 1 - 1 ■ (i - P(X e u) + P(x e u)x) 

we note that F l can be computed in 0(i) time given J 71-1 . Since J 70 — lx° = 1, 
we conclude that F N can be computed in 0(N 2 ) time. To reduce the complexity 
to O(N) we exploit that we only need to consider the coefficients Cj in the 
generating function J- 1 where j < minSup, since: 

— The frequentness probability of X is defined as P(X is frequent) — P(Sup(X) 
> minSup)) = 1 — P(Sup(X) < minSup) = 1 — ^™^ Su p 1 c . 



— A coefficient Cj in J- 1 is independent of any Ck in J-" 1-1 where fc > j. That 
means in particular that the coefficients Cfc, k > minSup are not required to 
compute the Cj, i < minSup. 

Thus, keeping only the coefficients Cj where j < minSup, T l contains at most 
minSup coefficients, leading to a total complexity of 0{minSup- N) to compute 
the frequentness probability of an itemset. 

Example 3. As an example, consider itemset {^4, D} i n the running example 
database in Figure [I] Using the ProFP-Tree (c.f. Figure 2(a) I, we can efficiently 
extract, for each transaction ti, the probability P({A, D\ E ti), where < 
P({A,D} G tj) < 1 and also the number of certain occurrences of {A,D}. 
Itemset {A, D} certainly occurs in no transaction and occurs in £2,^3 and £4 
with a probability of 0.1, 0.4 and 0.5 respectively. Let minSup be 2: 

T 1 = F° ■ (0.9 + 0.1a;) = O.lx 1 + 0.9a; 
F 2 = T X ■ (0.6 + 0.4x) = 0.04a; 2 + 0.42a; 1 + 0.54a; = 0.42a; 1 + 0.54a; 
T 3 = T 2 ■ (0.5 + 0.5a;) = 0.21a; 2 + 0.48a; 1 + 0.27a; 
= 0.48a- 1 + 0.27a; 

Thus, P{sup({A, D}) = 0) = 0.27 and P(sup{{A, D}) = 1) = 0.48. We get that 
P(sup({A, D}) > 2) = 0.25. Thus, {A,D} is not returned as a frequent itemset 
if t is greater than 0.25. Equations marked by a * exploit that we only need to 
compute the Cj where j < minSup. 

Note that at each iteration of computing J 71 , we can check whether 1 — 
^2i<minSup c i — T an d if that is the case, we can stop the computation and 
conclude that the respective itemset (for which T is the generating function) 
is frequent. Intuitively, the reason is that if an itemset X is already frequent 
considering the first i transactions only, X will still be frequent if more transac- 
tions are considered. This intuitive pruning criterion corresponds to the pruning 
criterion proposed in [6J for the Poisson Binomial Recurrence approach. 

We remark that the generating function technique can be seen as a variant of 
the Poisson Binomial Recurrence. However, using generating functions instead of 
the complicated recursion formula gives us a much cleaner view on the problem. 
In addition, using generating functions, the support probability density func- 
tion (sPDF) can be updated easily if a transaction ti changes its probability of 
containing an itemset X. That is, if the probability p — P(X € ti) changes to 
p' , then we can simply obtain the expanded polynomial from the old sPDF and 
divide it by px + (1 — p) (using polynomial division) to remove the effect of ti 
and multiply p'x + (1—p') to incorporate the new probability of containing X. 
That is, F l '(x) = J~ l {x) : (px + l—p) x (p'x+l—p'), where F 1 ' is the generating 
function of the sPDF of X in the changed database containing t^. 



6 Extracting Conditional ProFP-Trees 

This section describes how conditional ProFP-Trees are constructed from other 
(potentially conditional) ProFP-Trees. The method for doing this is more in- 
volved than the analogous operation for the certain FPGrowth algorithm, since 



we must ensure that the information capturing the source of the uncertainty 
remains correct. That is. whether the uncertainty at that node comes from the 
prefix or from the present node. Recall from Section [4] that this is required in 
order to extract the correct probabilities from the tree. A conditional ProFP- 
Tree for itemset X (treex) is equivalent to a ProFP-Tree built on only those 
transactions in which X occurs with a non-zero probability. In order to generate 
a conditional ProFP-Tree for itemset XUi (treexui) where i occurs lexicograph- 
ically prior to any item in X, we first begin with the conditional ProFP-Tree 
for X. When X = 0, treex is simply the complete ProFP-Tree. We construct 
treexui by propagating the values at the nodes with item — i upwards and 
accumulating these at the nodes closer to the root as listed in Algorithm [3] Let 
Ni be the set of nodes with item = i (These are obtained by following the links 
from the header table). The values for every node n in the resulting conditional 
tree treexui are calculated as follows: 



— n. count — Y] n ,c N , rii. count since these represent certain transactions. 

— n.uft = Uni.uft\ni £ Ni since we are conditioning on an item that is uncer- 
tain in these transactions and hence any node in the final conditional tree 
will also be uncertain for these transactions. 

— When collecting transactions for n that are uncertain from the prefix (i.e. t £ 
ufp), we must determine whether the item n.item caused this uncertainty. 
If the corresponding node in treex contained transaction t in ufp, then t 
is also in n.ufp (n.item was not uncertain in t). If n.item was uncertain in 
t, then the corresponding node in treex would have t listed in uft and this 
must also remain the case for the conditional tree. If t £ n.ufp is neither in 
the corresponding ufp nor uft in treex, then it must be certain for n.item 
and n.count is incremented. Using this approach, we can avoid storing the 
set of transactions for which an item is certain. This is a key idea in our 
ProFP-Tree. 



7 ProFP-Growth Algorithm 



We have now described the three fundamental operations of the ProFP-Growth 
Algorithm; building the ProFPTree (Section [3]); efficiently extracting the certain 
support and uncertain transaction probabilities from it (Section [4j; calculating 
the frequentness probability and determining whether an item(set) is a proba- 
bilistic frequent itemset (Section |5| ; and construction of the conditional ProF- 
PTrees (Section [Gj) . Together with the fact that probabilistic frequent itemsets 
possess an antimonotonicity property (Lemma 17 in [5]), we can use a similar 
approach to the certain FPGrowth algorithm to mine all probabilistic frequent 
itemsets. Since, in principle, this is not substantially different from substituting 
the corresponding steps in FP-Growth, we omit further details. 



Algorithm 3 Construction of a conditional ProFP-Tree treexui by 'extracting' 

item i from the conditional ProFP-Tree for itemset X. 

//Accumulates transactions for nodes when propagating up the values 
//from a node being extracted. 
class Accumulator 

count — 0; uft = 0; ufp = 0; 
orig ufp = the original ufp list 
&dd(ProFPNode n) 
count+ = n. count; 
uft — uft U n.uft; 
for (t 6 n.ufp) 

if {prig _ufp.contains(t)) ufp = ufp U t; 
else if (orig _uft.contains(t)) uft = u/t U t; 
else count + + ; 

buildConditionalProFPTree(ProPPTreetreex, itemi) returns treexui 
treexui =clone of the subtree of treex reachable from header table for i; 
associate an Accumulator with each node in treexui and set orig _ufp\ 
propagate(treexui , i) ; 

set the certSup, uft, ufp values of nodes in treexui to those in the 
corresponding Accumulators; 

propagate(Pr oFPTreetree, itemi) 

for(ProFPNoden accessible from header table for i) 
ProFPNode cn — n; 
whi\e(cn.parent ■/ null) 

call add(n) on Accumulator for cn; 
cn = cn. parent; 



8 Experimental Evaluation 

In this section, we present performance experiments using our proposed ProFP- 
Growth algorithm and compare the results to the Apriori-based solution (denoted 
as ProApriori) presented in [B]. We also analyze how various database charac- 
teristics and parameter settings affect the performance of ProFP — Growth. 

All experiments were performed on an Intel Xeon with 32 GB of RAM and 
a 3.0 GHz processor. For the first set of experiments, we used artificial datasets 
with a variable number of transactions and items. Each item x has a probability 
P\(x) of appearing for certain in a transaction, and a probability Pq(x) of not 
appearing at all in a transaction. With a probability 1 — Pq(x) — P\{x) item x 
is therefore uncertain in a transaction. In this case, the probability that x exists 
in a transaction is picked randomly from a uniform (0, 1) distribution. 

For our scalability experiments, we scaled the number of items and transac- 
tions and chose Pq(x) — 0.5 and Pi(x) = 0.2 for each item. We measured the run 
time required to mine all probabilistic frequent itemsets that have a minimum 
support of 10% of the database size with a probability of a least r = 0.9. 
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Fig. 4. Scalability w.r.t. the number of transactions. 



8.1 Number of Transactions 

We scaled the number of transactions and used 20 items. The results can be seen 



in Figure 4(a) In this setting, our approach significantly outperforms ProApriori 
[BJ. The time required to build the ProFP-Tree w.r.t. the number of transactions 
is depicted in Figure |4(b)| The observed linear run time indicates a constant 
time required to insert transactions into the tree. This is expected since the 
maximum height of the ProFP-Tree is equal to the number of items. Finally, 
we evaluated the size of the ProFP-Tree for this experiment, shown in Figure 
4(c) The number of nodes in the ProFP-Tree increases and then plateus as the 
number of transactions increases. This is because new nodes have to be created 
for those transaction where a suffix of the transaction is not yet contained in the 
tree. As the number of transactions increases, the overlap between transaction 
prefixes increases, requiring fewer new nodes to be created. It is expected that 
this overlap increases faster if the items are correlated. Therefore, we evaluate the 
size of the ProFP-Tree on subsets of the real- world dataset accidental denoted 



The accidents dataset |10| was derived from the Frequent Itemset Mining Dataset 
Repository ( |http:/ /fimi.cs.helsinki.fi/data/[ | 
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Fig. 5. Scalability with respect to the number of items. 



by ACC. It consists of 340, 184 transactions and a reduced number of 20 items 
whose occurrences in transactions were randomized; with a probability of 0.5, 
each item appearing for certain in a transaction was assigned a value drawn from 
a uniform distribution in (0, 1]. We varied the number of transactions from ACC 
up to the first 300, 000. As can be seen in Figure |4(d)| there is more overlap 
between transactions since the growth in the number of nodes used is slower 



(compared to Figure 4(c) |. 



8.2 Number of Items 

Next, we scaled the number of items using 1, 000 transactions. The run times for 
5 to 100 items can be seen in Figure 5(a)| which shows the expected exponential 
runtime inherent in FIM problems. It can be clearly seen that the ProFP-Growth 
approach vastly outperforms ProApriori. 

Figure |5(b)| shows the number of nodes used in the ProFP- Tree. Except for 
very few items, the number of nodes in the tree grows linearly. 



8.3 Effect of Uncertainty and Certainty 

In this experiment, we set the number of transactions to 1, 000 and the number 
of items to 20 and varied the parameters Pq(x) and P±(x). 

For the experiment shown in Figure 6(a)| we fixed the probability that items 
are uncertain (1 — Pq(x) — Pi(x)) at 0.3 and successively increased P\{x) from 
(which means that no items exist for certain) to 0.7. It can be observed that the 
number of nodes initially increases. This is what we would expect, since more 
items existing in the database increases the nodes required. However, as the 
number of certain items increases, an opposite effect reduces the number of nodes 
in the tree. This effect is caused by the increasing overlap of the transactions - 
in particular, the increased number and length of shared prefixes. When Pi(x) 
reaches 0.7 (and thus Po(x) = 0), each item is contained in each transaction 
with a probability greater than zero, and thus all transactions contain the same 
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Fig. 6. Effect of certainty and uncertainty on the ProFP-Tree size and uncertain item 
lookup table. 



items with non-zero probability. In this case, the ProFP-Tree degenerates to a 
linear list containing exactly one node for each item. Note that the size of the 
uncertain item lookup table is constant, since the expected number of uncertain 
items is constant at 0.3 • \T\ ■ \I\ = 0.3 • 1, 000 • 20 = 6, 000. 

In Figure [6(b) | we fixed Pi(x) at 0.2 and successively decreased Po(x) from 
0.8 to 0, thus increasing the probability that items are uncertain from to 0.8. 



We see a similar pattern as in Figure 6(a) for the number of nodes, for similar 



reasons. As expected here, the size of the lookup table increases as the number 
of uncertain items increases. 
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8.4 Effect of minSup 



Here, we varied the minimum support threshold minSup using an artificial 
database of 10, 000 transactions and 20 items. Figure [7] shows the results. For 
low values of minSup, both algorithms have a high run time due to the large 
number of probabilistic frequent itemsets. It can be observed that Pro FP- Growth 
significantly outperforms ProApriori for all settings of minSup. 

9 Conclusion 

The Probabilistic Frequent Itcmset Mining (PFIM) problem is to find itemsets 
in an uncertain transaction database that are (highly) likely to be frequent. 
This problem has two components; efficiently computing the support probability 
distribution and frequentness probability, and efficiently mining all probabilistic 
frequent itemsets. To solve the first problem in linear time, we proposed a novel 
method based on generating functions. To solve the second problem, we proposed 
the first probabilistic frequent pattern tree and pattern growth algorithm. We 
demonstrated that this significantly outperforms the current state of the art 
approach to PFIM. 
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